1 Introduction

The purpose of this script is to fit the Cubist and Random forest models to the training data set and generate predictions on the test data set.

For the training dataset, the response period measures the crime rates for Jan 1, 2018 - Dec 31, 2018.
We use the index date 2017-12-31 to label this script.
So the index date is the observation date when predictors are known before fitting the model to the training period response.

The base date is 2018-12-31 in this document. It is the last date before the start of the test period. For the test dataset, the response period measures the crime rates for Jan 1, 2019 - Dec 31, 2019.

Both models will be fitted on the training dataset to predict the response of the test dataset.

grid_cellsize = 490  # Hard-coded

loadRDSModel = TRUE # Skips Cubist Tuning entirely.  If TRUE then tuningFull is not used.

loadRandomForestModel = TRUE  # Skips Random Foret Tuning entirely.  If TRUE then tuningFull is not used.

tuningFull = TRUE   # Tune the model with various calibration (TRUE = takes a long time,  FALSE takes less time)  Only applies if loadRDSModel == TRUE
all_data_on_grid_training_file = paste0(data_dir, "EXP_ALL_DATA_ON_GRID_490_2017-12-31.geojson")

all_data_on_grid_test_file = paste0(data_dir, "EXP_ALL_DATA_ON_GRID_490_2018-12-31.geojson")

To generate the training and test data sets, we load the grid files and strip out non-predictor or redundant columns. Since we are predicting crime frequencies, we strip out the crime outs. Since the models don’t require the grid geometries, we drop the geometry column.

2 Cubist Model - 2019

We train the Cubist model using caret with 10-100 committees and 0-9 neighbors. The selection criteria is the best model based on RMSE.

set.seed( 1027)

cubistControlFull <- trainControl(method = "cv" ,  selectionFunction = "best")
tuneGridFull  <- expand.grid( committees = c( 10, 50, 100 ) ,
                              neighbors = c( 0, 1, 5, 9 )
                                                  ) 

cubistControlLight <- trainControl(method = "cv" ,  selectionFunction = "best", verboseIter = TRUE)
tuneGridLight <- expand.grid( committees = c( 100 ) , 
                              neighbors = c( 0  )  )

rdsFileName = "cubistTune_train_2017-12-31.rds"

if(loadRDSModel == FALSE ){
  
   if(tuningFull == TRUE)
   {
        cubistControl = cubistControlFull
        cubistGrid = tuneGridFull
   } else  {
        cubistControl = cubistControlLight
        cubistGrid = tuneGridLight
   }

    (cubistTune = caret::train( x = X_vars_on_grid_train, 
                          y = Y_on_grid_train , 
                          method = "cubist",
                          tuneGrid = cubistGrid ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          trControl = cubistControl ) )
  
    saveRDS(cubistTune, file = rdsFileName)
  
} else {
  
   cubistTune = readRDS(rdsFileName)
}

cubistTune
## Cubist 
## 
## 19782 samples
##    48 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 17803, 17804, 17804, 17804, 17804, 17804, ... 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE       Rsquared   MAE      
##    10         0          0.8035202  0.6668301  0.2261010
##    10         1          0.9087816  0.5859605  0.2974644
##    10         5          0.8149135  0.6500482  0.2754751
##    10         9          0.8052677  0.6580938  0.2720496
##    50         0          0.8082655  0.6635980  0.2257975
##    50         1          0.9131579  0.5790220  0.2976213
##    50         5          0.8191995  0.6454334  0.2753070
##    50         9          0.8075419  0.6555289  0.2716010
##   100         0          0.8106496  0.6615399  0.2261427
##   100         1          0.9149556  0.5775151  0.2977956
##   100         5          0.8214982  0.6433496  0.2757077
##   100         9          0.8097738  0.6535768  0.2719090
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 10 and neighbors = 0.

The Cubist dotplot of the splits are not that useful because they take a long time to run.

Next, we consider the prediction on the test set.

Let’s use PAI and AUC to measure the quality of the hotspot predictions.

We can define hotspots as the top 100 grid cells by predicted crime frequency.

## [1] "Area in miles^2 of Raleigh Buffered at 100 feet: 156.116262020913 miles"
## Loading required namespace: raster

3 Random Forest Model

Now we evaluate the Random forest predictive accuracy

set.seed( 1027)

rfControlFull <- trainControl(method = "cv" ,  selectionFunction = "best", verboseIter = TRUE)
rfTuneGridFull  <- expand.grid( .mtry = seq(2, 20, by = 4) , .min.node.size = c( 5, 10 ) , .splitrule = "variance"  )


rfControlLight <- trainControl(method = "cv" ,  selectionFunction = "best", verboseIter = TRUE)
rfTuneGridLight  <- expand.grid( .mtry = c(5, 10) , .min.node.size = c(5), .splitrule = "variance" )
                  
                                                  
rfRdsFileName = "randomForestTune_train_2017-12-31.rds"

if(loadRandomForestModel == FALSE ){
  
   if(tuningFull == TRUE)
   {
        print("tuningFull is TRUE")
        rfControl = rfControlFull
        rfGrid    = rfTuneGridFull
   } else  {
        print("tuningFull is FALSE")
        rfControl = rfControlLight
        rfGrid    = rfTuneGridLight
   }
  
    (randomForestTune = caret::train( x = X_vars_on_grid_train, 
                          y = Y_on_grid_train , 
                          method = "ranger",
                          tuneGrid = rfGrid ,
                          verbose = FALSE,
                          metric = "RMSE" ,
                          importance = "impurity" ,
                          num.trees = 1000, 
                          trControl = rfControl ) )
  
    saveRDS(randomForestTune, file = rfRdsFileName)
  
} else {
  
   randomForestTune = readRDS(rfRdsFileName)
}

The tuning results are shown below. There is not much variation in the MAE or R-Squared after we pick mtry greater than 2.

## Random Forest 
## 
## 19782 samples
##    48 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 17803, 17804, 17804, 17804, 17804, 17804, ... 
## Resampling results across tuning parameters:
## 
##   mtry  min.node.size  RMSE       Rsquared   MAE      
##    2     5             0.9390429  0.6479553  0.3481696
##    2    10             0.9444427  0.6415365  0.3490212
##    6     5             0.8174178  0.6707519  0.2858303
##    6    10             0.8224436  0.6665112  0.2870013
##   10     5             0.8084662  0.6690664  0.2810804
##   10    10             0.8137362  0.6657344  0.2824779
##   14     5             0.8072730  0.6667083  0.2809888
##   14    10             0.8116187  0.6648430  0.2815180
##   18     5             0.8098978  0.6630919  0.2814671
##   18    10             0.8149780  0.6609128  0.2819589
## 
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 14, splitrule = variance
##  and min.node.size = 5.

Let’s use PAI and weighted ROC to measure the quality of the hotspot predictions.

Comparison of Random Forest and Cubist Models
rank id_grid_rf cum_crime_freq_rf prop_cum_area_rf prop_cum_crime_freq_rf PAI_rf RRI_rf id_grid_cu cum_crime_freq_cu prop_cum_area_cu prop_cum_crime_freq_cu PAI_cu RRI_cu
1 10922 50 0.000 0.008 144.116 0.830 10922 50 0.000 0.008 144.116 1.217
2 12189 91 0.000 0.014 131.146 0.891 12189 91 0.000 0.014 131.146 1.282
3 12401 138 0.000 0.022 132.587 0.829 12401 138 0.000 0.022 132.587 1.227
4 10923 173 0.000 0.028 124.660 0.851 11101 154 0.000 0.024 110.969 1.334
5 11101 189 0.000 0.030 108.952 0.921 10923 189 0.000 0.030 108.952 1.255
6 11977 206 0.000 0.033 98.960 0.956 12613 209 0.000 0.033 100.401 1.272
7 12649 230 0.000 0.037 94.705 0.955 11977 226 0.000 0.036 93.058 1.292
8 14140 251 0.000 0.040 90.433 0.964 12649 250 0.000 0.040 90.073 1.266
9 10710 260 0.000 0.041 83.267 1.016 14140 271 0.000 0.043 86.790 1.252
10 13928 269 0.001 0.043 77.534 1.059 13928 280 0.001 0.045 80.705 1.282

4 Model Comparison

## [1] "AUC for Cubist Model is: 0.88307614428645"
## [1] "AUC for Random Forest is: 0.893707917775985"
Model PAI Comparison Test 2019
PAI
Cum. # Assaults
Cum. Assault %
Cum. Area%
Rank RF Cubist RF Cubist RF Cubist Area
1 144.12 144.12 50 50 0.80 0.80 0.01
5 108.95 108.95 189 189 3.01 3.01 0.03
25 60.26 58.00 507 488 8.06 7.76 0.13
100 37.24 37.27 1282 1283 20.38 20.40 0.55
200 26.46 26.76 1828 1849 29.07 29.40 1.10